[SPARK-12913] [SQL] Improve performance of stat functions #10960

davies · 2016-01-27T23:34:24Z

As benchmarked and discussed here: https://github.com/apache/spark/pull/10786/files#r50038294, benefits from codegen, the declarative aggregate function could be much faster than imperative one.

davies · 2016-01-27T23:38:31Z

cc @mengxr

SparkQA · 2016-01-27T23:49:43Z

Test build #50240 has started for PR 10960 at commit 61edd5e.

davies · 2016-01-28T06:55:35Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala

Creating a Cast() here is very expensive

SparkQA · 2016-01-28T08:56:55Z

Test build #50265 has finished for PR 10960 at commit 3c8d737.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-28T18:06:16Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala

Those if branches are important to save computation for low-order statistics. Even we won't use CentralMomentAgg for second-order statistics, it is still good to keep them.

mengxr · 2016-01-28T18:14:27Z

@davies Did you get a chance to test whole-stage codegen with higher-order statistics like skewness? If it works, the cleanest solution would be changing CentralMomentAgg to declarative and then make all existing univariate summary statistics call it.

SparkQA · 2016-01-28T20:12:41Z

Test build #50294 has finished for PR 10960 at commit ae83955.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class CentralMomentAgg(child: Expression) extends DeclarativeAggregate
- case class Kurtosis(child: Expression) extends CentralMomentAgg(child)
- case class Skewness(child: Expression) extends CentralMomentAgg(child)
- case class Echo(child: Expression) extends UnaryExpression

SparkQA · 2016-01-28T22:09:42Z

Test build #50297 has finished for PR 10960 at commit 1481bb4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala

SparkQA · 2016-01-28T23:15:13Z

Test build #50304 has finished for PR 10960 at commit 1b95b7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-29T00:45:02Z

@davies side note: The JIRA number is wrong.

SparkQA · 2016-01-29T02:37:32Z

Test build #50322 has finished for PR 10960 at commit ae78e81.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala

SparkQA · 2016-01-29T17:37:32Z

Test build #50384 has finished for PR 10960 at commit 1086810.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SetDatabaseCommand(databaseName: String) extends RunnableCommand

davies · 2016-01-29T18:12:51Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/WindowQuerySuite.scala

The new implementation of Corr/Covar have better accuracy, so updated the tests to match that.

It would be nice to see we can tolerate some small numerical differences in query tests. But this is out of scope here.

SparkQA · 2016-01-29T18:36:02Z

Test build #50386 has finished for PR 10960 at commit 383c193.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Corr(x: Expression, y: Expression) extends DeclarativeAggregate
- abstract class Covariance(x: Expression, y: Expression) extends DeclarativeAggregate

mengxr · 2016-02-02T07:33:26Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala

ditto on EqualTo => ===

mengxr · 2016-02-02T07:40:12Z

@davies I made one pass. It would be nice to have a JIRA for checking query result with tolerance on numerical differences, because the result might change (though unlikely) if we merge the partial results in a different order.

davies · 2016-02-02T08:29:41Z

@mengxr Thanks for reviewing this, I should had addressed all your comments.

SparkQA · 2016-02-02T09:02:11Z

Test build #50560 has finished for PR 10960 at commit 9b74195.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PrintToStderr(child: Expression) extends UnaryExpression

mengxr · 2016-02-02T09:58:16Z

test this please

mengxr · 2016-02-02T10:05:10Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala

Sorry for some miscommunication. The previous inline comments are useful here because Lit(0.0) carries no information. The comments are not necessary when the variable names can clearly tell what they are. Please recover the inline comments for initial values.

There is no difference for these initial values, the order does not matter here. Do you still think we should keep those comments? Or should I change to use fill()?

SparkQA · 2016-02-02T11:41:03Z

Test build #50565 has finished for PR 10960 at commit 9b74195.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PrintToStderr(child: Expression) extends UnaryExpression

mengxr · 2016-02-02T18:45:01Z

LGTM pending Jenkins. It is great to see 5x speedup!

SparkQA · 2016-02-02T18:48:59Z

Test build #50569 has finished for PR 10960 at commit 5f98588.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PrintToStderr(child: Expression) extends UnaryExpression

davies · 2016-02-02T18:59:24Z

Merging this into master, thanks!

SparkQA · 2016-02-02T20:19:30Z

Test build #50572 has finished for PR 10960 at commit fe6fe50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

improve stddev and variance

61edd5e

davies force-pushed the stddev branch from 70a7c7e to 61edd5e Compare January 27, 2016 23:38

davies reviewed Jan 28, 2016
View reviewed changes

improve accuracy

3c8d737

mengxr reviewed Jan 28, 2016
View reviewed changes

atomic mutable projection

ae83955

Davies Liu added 2 commits January 28, 2016 13:45

cleanup

1481bb4

reimplement corr/covariance

448e0e1

davies changed the title ~~[SPARK-12963] Improve performance of stddev/variance~~ [SPARK-12963] Improve performance of stat functions Jan 28, 2016

Merge branch 'master' of github.com:apache/spark into stddev

1b95b7c

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala

davies force-pushed the stddev branch from ce86fa9 to 1b95b7c Compare January 28, 2016 22:14

fix tests

ae78e81

davies changed the title ~~[SPARK-12963] Improve performance of stat functions~~ [SPARK-12913] Improve performance of stat functions Jan 29, 2016

davies changed the title ~~[SPARK-12913] Improve performance of stat functions~~ [SPARK-12913] [SQL] Improve performance of stat functions Jan 29, 2016

Davies Liu added 2 commits January 29, 2016 08:06

Merge branch 'master' of github.com:apache/spark into stddev

1086810

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala

update tests

383c193

disable udaf_covar_pop/udaf_covar_samp

ab32659

davies reviewed Jan 29, 2016
View reviewed changes

mengxr reviewed Feb 2, 2016
View reviewed changes

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala Outdated

Copy link

Contributor

mengxr Feb 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on EqualTo => ===

davies force-pushed the stddev branch from ef400a6 to 9b74195 Compare February 2, 2016 08:27

mengxr reviewed Feb 2, 2016
View reviewed changes

address comments

5f98588

davies force-pushed the stddev branch from 9b74195 to 5f98588 Compare February 2, 2016 16:42

address comments

fe6fe50

asfgit closed this in be5dd88 Feb 2, 2016

[SPARK-12913] [SQL] Improve performance of stat functions #10960

[SPARK-12913] [SQL] Improve performance of stat functions #10960

Uh oh!

Conversation

davies commented Jan 27, 2016

Uh oh!

davies commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

davies Jan 28, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 28, 2016

Uh oh!

mengxr Jan 28, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jan 28, 2016

Uh oh!

SparkQA commented Jan 28, 2016

Uh oh!

SparkQA commented Jan 28, 2016

Uh oh!

SparkQA commented Jan 28, 2016

Uh oh!

mengxr commented Jan 29, 2016

Uh oh!

SparkQA commented Jan 29, 2016

Uh oh!

SparkQA commented Jan 29, 2016

Uh oh!

davies Jan 29, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr Feb 2, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 29, 2016

Uh oh!

mengxr Feb 2, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Feb 2, 2016

Uh oh!

davies commented Feb 2, 2016

Uh oh!

SparkQA commented Feb 2, 2016

Uh oh!

mengxr commented Feb 2, 2016

Uh oh!

mengxr Feb 2, 2016

Choose a reason for hiding this comment

Uh oh!

davies Feb 2, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2016

Uh oh!

mengxr commented Feb 2, 2016

Uh oh!

SparkQA commented Feb 2, 2016

Uh oh!

davies commented Feb 2, 2016

Uh oh!

SparkQA commented Feb 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants